Skip to content

Conversation

@JmScherer
Copy link

@JmScherer JmScherer commented Dec 17, 2025

Please check if the PR fulfills these requirements

  • Tested as per the documentation and they passed
  • Docs have been added / updated (for bug fixes / features)

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

  • Updated the .gitignore to not commit DITTO output or work done by NextFlow
  • Updated the root project's README.md to include new instructions, requirements, document linking, and notes
  • Consolidated directory structure for various configuration files into .config subfolders per service and updated code to reflect new paths
  • Added a python=3.10 dependency for the configs/conda/open-cravat.yml conda environment
  • Included a configs/nextflow/local.config file to allow NextFlow to use Anaconda when running DITTO locally on a device
  • Consolidated shap_plots directory into docs directory
  • Updated the pipeline.nf to make the designated output directory along with the parent folders
    • Default output folder is $PWD/data/output
  • Updated the HPC slurm model.job file to output DITTO scores to $PWD/data/output when finished
    • This can be overridden to user preference, it just needs to be a full path otherwise NextFlow won't be able to access and save results

What is the current behavior? (You can also link to an open issue here)

DITTO is difficult to run and this should help streamline the process.

What is the new behavior (if this is a feature change)?

This update is intended to make DITTO more user friendly and easier to run.

Does this PR introduce a breaking change?

N/A


To Review:

  • Static Code Analysis by Reviewer
  • Clone repo and change to the local-prediction branch
  • Depending on the environment follow the HPC Prediction with Cheaha or Local Prediction instructions and notes below

HPC Prediction with Cheaha

  • Follow the README.md for HPC Prediction with Cheaha

In the root DITTO folder, run tail -f DITTO_logs.out to see the output

Local Prediction Notes:

Reviewer will need the local device to have access to all the OpenCravat annotators, this will be roughly 600GB in disk space. Contact PR assignee for an external drive containing the data if you would like to test this route.

It looks like OpenCravat conda environment needs to be created and the module path set before DITTO can run. Even though it's in the NextFlow pipeline to set the path, it doesn't seem to do it.

  • Create OpenCravat conda environment
    • conda create --name opencravat-env
    • conda activate opencravat-env
    • conda env update -n opencravat-env --file ./configs/conda/open-cravat.yaml
    • oc config md /Volumes/my_book/opencravat/modules
    • conda deactivate
  • Follow the README.md for Local Prediction

Local Prediction

.test_data/file_list.txt

Screenshot 2025-12-16 at 3 14 46 PM

Running DITTO locally

Screenshot 2025-12-16 at 3 51 59 PM

DITTO completing locally

Screenshot 2025-12-17 at 9 02 35 AM

HPC Prediction with Cheaha

.test_data/file_list.txt

Screenshot 2025-12-16 at 3 16 14 PM

Running DITTO with HPC

Screenshot 2025-12-16 at 3 15 17 PM

DITTO completing on Cheaha

Screenshot 2025-12-17 at 9 01 42 AM

@JmScherer JmScherer self-assigned this Dec 17, 2025
@JmScherer JmScherer added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 17, 2025
Copy link
Member

@wilkb777 wilkb777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor change to make testing the pipeline better/easier (requiring no file modifications out of the box) and corresponding changes to documentation.

pipeline.nf Outdated
workflow {

// Define input channels for the VCF files
vcfFile = Channel.fromPath(params.sample_sheet).splitCsv(header: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vcfFile = Channel.fromPath(params.sample_sheet).splitCsv(header: false)
vcfFile = Channel.fromPath(params.sample_sheet)
.splitText() // Emit each line as a separate item
.map { line ->
// For each line (relative path), create a Nextflow file object relative to params.data_dir
if (line.startsWith("/")){
return line.trim()
} else {
def abs_path = file(workflow.launchDir).resolve(line.trim())
return abs_path
}
}
.map { path_obj ->
// Ensure the path is a proper Path object for staging
file(path_obj, checkIfExists: true)
}

This will enable relative pathing for file paths (relative to the launch directory, which is the directory the pipeline.nf workflow was run from) specified in the input sample sheet text files. This will allow you to delete the bit in the README instructions about having to specify full paths for inputs (which cleans up testing nicely). I've already tested that this works locally and on Cheaha (assuming run sbatch from the repos root directory) so this can be directly added in with no further review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! This will simplify the process further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the README.md to reflect this change. Now supports absolute and relative pathing for vcf.gz files in the file list.

README.md Outdated
Comment on lines 149 to 150
- Update the `.test_data/file_list.txt` (inout vcfs) files with complete file paths and submit a slurm job using the
command below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Update the `.test_data/file_list.txt` (inout vcfs) files with complete file paths and submit a slurm job using the
command below
- Create a text file listing the path to VCF file(s) (1 path per line) with variants to score
- Paths can be full absolute paths **or** relative paths (relative to the directory where the pipeline will be run from, **not** the directory where the `pipeline.nf` file is)
- See the example input file [.test_data/file_list.txt](.test_data/file_list.txt) (lists 2 testing example input VCFs)
for reference or as an input file for testing (default behavior of `model.job`)

updated this text to be a bit more explicit and clear on what to do for input (see suggestion on supporting relative pathing for input files in pipeline.nf)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the README.md, provided real examples of the relative/absolute pathing in addition to this clarification.

JmScherer and others added 4 commits December 18, 2025 14:06
Co-authored-by: Brandon M Wilk <[email protected]>
Co-authored-by: Brandon M Wilk <[email protected]>
Co-authored-by: Brandon M Wilk <[email protected]>
Co-authored-by: Brandon M Wilk <[email protected]>
Copy link
Member

@sdhutchins sdhutchins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mainly suggestions and not requirements. This is good to go after @wilkb777's suggested changes go in.

…ile_list.txt for pathing on vcfs for DITTO; Updated README to discuss the use of Mamba for NextFlow
…throwing errors. We suspect the version it pulled automatically was too new for the pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants